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I.  Introduction 

The  current  state  of  technology  implies  that  memories  which 
include  many  cells  must  be  partitioned  into  a  number  of  modules  each 
containing  many  cells;  where,  only  one  cell  (or  a  small  number  of 
cells)  of  each  module  can  be  accessed  at  a  time.  For  more  on  this,  see 
Kuck  [Ku-77]  and  Gottlieb  et  al.  [GGKMRS-83].  On  the  other  hand,  many 
published  parallel  algorithms  are  designed  for  abstract  shared  memory 
models  of  parallel  computation,  where  the  processors  have  free  access 
to  each  cell  of  the  shared  memory  for  both  read  and  write  purposes.  An 
obvious  difficulty  arises  when  one  wants  to  simulate  these  algorithms 
on  buildable  machines.  One  approach  is  to  require  from  designers  of 
algorithms  for  abstract  shared  memory  models  of  parallel  computation  to 
limit,  as  much  as  possible,  the  size  of  the  shared  memory  that  the 
algorithm  must  use.  This  is  usually  done  in  favor  of  more  local 
computations  in  which  each  processor  accesses  its  own  local  memory 
only.  Kuck  mentions  several  papers  that  practiced  this  ad-hoc 
approach.  Even  in  cases  where  such  a  limitation  is  possible  this 
approach  puts  some  additional  burden  on  the  designer  which  is  not 
desirable. 

Let  us  be  a  bit  more  precise.  Given  a  shared  memory  model  of 
parallel  computation  D  we  define  M(D)  to  be  the  model  of  computation 
which  is  derived  from  D  by  partitioning  the  shared  memory  of  D  into 
modules  so  that  no  more  than  one  cell  of  each  module  can  be  accessed  at 
a  time.  If  there  are  several  simultaneous  requests  for  the  same  common 
memory  location  in  M(D)  they  are  treated  in  the  same  way  as   in  D.   If 


there  are  several  simultaneous  requests  for  different  cells  of  the  same 
module,  they  are  queued  and  responded  one  at  a  time. 

The  granularity  problem  is  defined  as  the  problem  of  simulating  a 
cycle  of  D  by  M(D).  Automatic  solutions  for  the  general  case  where  we 
do  not  know  anything  about  the  cycle  to  be  simulated  are  discussed  in 
Mehlhorn  and  Vishkin  [MV-82].  They  suggest  a  multi-stage  approach  for 
attacking  the  granularity  problem.  We  mention  the  two  main  stages. 
The  first  stage  designed  to  keep  us  'out  of  trouble',  in  the  average 
case,  utilizes  universal  hashing  in  the  simulating  machine  M(D).  M(D) 
itself  picks  at  random  a  hashfunction  from  an  entire  class  of 
hashfunctions  before  each  simulation  of  an'  algorithm,  instead  of  a 
specific  hashfunction.  This  is  shown  to  keep  memory  contention  low. 
The  idea  behind  the  second  stage  is  to  keep  several  copies  of  each 
memory  address  in  distinct  memory  modules.  This  idea,  in  conjunction 
with  fast  algorithms  for  picking  the  'right'  copy  of  each  requested 
address  is  shown  to  decrease  memory  contention  for  the  worst  case 
results  of  the  first  stage. 

The  main  result  of  the  present  paper  is  that  in  a  few  general 
cases  the  idea  of  dynamically  changing  locations  of  addresses  among 
modules  throughout  the  performance  of  an  algorithm  provides  a  solution 
for  the  granularity  problem  in  constant  time  utilizing  only  as  many 
modules  as  the  number  of  processors. 

2.  A  relation  between  models  of  parallel  computation 

The  main  model  of  parallel  computation  that  is  used  in  the  present 
paper  is  the  exclusive-read  exclusive-write  parallel  random-access 
machine  (EREW  PRAM).  It  employs  p  processors  (RAM-s)  Pi,...,Pp  that 
operate  synchronously  in  parallel.  Each  processor  has  access  to  both  a 
shared  memory  of  size  N  and  its  private  local  memory.  Simultaneous 
access  of  more  than  one  processor  to  the  same  memory  location  is  not 
allowed.  At  each  cycle  a  processor  may  either  perform  an  operation 
that  relates  to  its  local  memory  or  read  from  a  shared  memory  address 
or  write  into  a  shared  memory  address.  The  convention  of  not  allowing 
simultaneous  access   by  several  processors  to  the  same  memory  location 
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is  used  in  [LPV-81].  This  model  is  a  member  in  a  whole  family  of 
shared-memory,  parallel  RAM  models  of  computation.  We  refer  the  reader 
to  [SV-82]  for  a  formal  definition  of  these  models  including  the  list 
of  operations  they  allow. 

A  second  model  of  computation  that  we  employ  is  the 
module-parallel-machine  (MPM) .  It  employs  r  processors  Ri  ,R2  , . . .  ,Rj. 
and  is  similar  to  the  EREW  PRAM  with  the  following  exception.  The  L 
cells  of  the  shared  memory  are  partitioned  among  m  modules.  Only  one 
cell',  of  each  module  can  be  accessed  at  any  cycle  of  the  MPM.  In  both 
models  of  computation  the  program  for  each  processor  is  located  in  its 
local  memory. 

How  do  these  models  relate? 

1.  Every  algorithm  for  an  EREW  PRAM  that  employs  p  processors  and 
shared  memory  of  size  N  can  be  run  on  an  MPM  using  p  processors  and  N 
nonempty  modules.  This  trivial  observation  follows  readily  by 
employing  one  memory  cell  at  each  module  of  the  MPM, 

2.  Suppose  that  we  are  given  an  algorithm  for  the  MPM  that  employs  p 
processors  and  m  shared  memory  modules;  suppose  that  module  i,  1  <  i  < 
m,  contains  N^  cells;  suppose  that  m  <  p  and  the  algorithm  runs  in  (at 
most)  T  cycles.  This  algorithm  can  be  simulated  by  the  EREW  PRAM  in 
0(T)  cycles  using  p  processors,  shared  memory  of  size  m  and  the  local 
memory  that  is  used  by  processor  P.  ,  1  <  i  <  m  (resp.  m  <  i  <  p) ,  of 
the  EREW  PRAM  is  greater  by  N^  than  (resp.  is  the  same  as)  the  local 
memory  of  processor  R.  of  the  MPM. 

The  rest  of  this  section  is  devoted  to  outline  how  this  is  done. 
Processor  P^  is  'responsible'  to  simulate  the  behavior  of  processor  R^ , 
for  1  <  i  <  p.  In  addition,  Processor  P^  is  'responsible'  to  simulate 
the  behavior  of  module  i,  1  <  i  <  m.  For  the  latter  purpose  each  cell 
of  module  i  of  the  MPM  is  reprsented  by  a  corresponding  cell  in  the 
local  memory  of  processor  Pj^  ,  1  <  i  <  m. 

The  simulation  proceeds  as  follows.   Each  cycle  t,  1  <  t  <  T,   of 
the  MPM  is   simulated  by  three  pulses  of  the  EREW  PRAM  denoted  (t,l), 
(t,2)  and  (t,3). 
Pulse  (t,l): 
If  ^j_   performed,  at  cycle  t,  an  operation  that  relates  to  its 
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local  memory  only 
Then  P^  does  the  same  with  respect  to  its  local  memory 
Else   If  R;j^  performed  a  read  instruction  from  cell  j  of  shared  memory 
module  £ 
Then  P^  writes  into  shared  memory  cell  li 

'cell  j  is  requested' 
Else  If  Rj^  wrote  some  value  v  into  cell  j  of  shared  memory 
module  Z 
Then  P^  writes  into  shared  memory  cell  li 
'write  V  into  cell  j' 

Pulse  (t.2); 

(Only  processors  P^^  ,  1  <  i  <  m,  are  active) 

If  shared  memory  cell  i  contains:  'cell  j  is  requested' 

Then  P^^  copies  the  contents  of  its  local  memory  address  which 

corresponds  to  cell  j  of  module  i  of  the  MPM  into  common  memory 

cell  i. 
Else   If  shared  memory  cell  i  contains: 
'write  V  into  cell  j' 

Then  P^  copies  v  into  its  local  address  corresponding 
to  cell  j  of  module  i 

Pulse  (t,3): 

If  Rj    performed  at  cycle  t  a  read  instruction  from  a  cell  of  module  I 


Proofs  of  correctness  of  this  simple  simulation  and  our  claims 
regarding  time  and  space  complexity  are  straightforward. 

3.  Reducing  the  size  of  the  shared  memory 

Suppose  that  we  are  given  an  algorithm  which  is  designed  for  the 
EREW  PRAM,  employs  p  processors,  uses  N  shared  memory  locations  and 
runs  in  T  cycles  for  some  input.  Suppose  p  <<  N.  Question.  Is  it 
possible  to  simulate  this  algorithm  on  an  EREW  PRAM  that  employs  the 
same  number  of  processors  and  "significantly"  less  than  N  shared  memory 
cells,  without  increasing  "too  much"  the  running  time? 


The  following  fact  gives  some  hope:  Since  there  are  p  processors, 
no  more  than  p  shared  memory  addresses  may  be  accessed  at  the  same 
time. 

Before  we  proceed  to  our  main  theorem,  we  would  like  to  say  the 
following  regarding  the  most  general  case. 

Remark.  In  general,  using  a  shared  memory  of  size  0(pT)  should 
suffice.  The  reason  for  this  is  that  we  can  maintain  all  shared  memory 
cells  which  are  actually  being  accessed  in  the  course  of  the  algorithm 
in  2-3  trees.  A  processor  may  initialize  only  one  cell  at  a  time. 
Therefore,  the  number  of  shared  memory  cells  that  can  be  initialized  is 
0(pT).  The  paper  [PW-83]  shows  how  to  perform  the  search  and 
insertion  operations  that  may  be  required  for  the  simulation  of  one 
cycle  of  the  algorithm  in  O(log  pT)  time  of  the  simulating  (EREW  PRAM) 
machine. 

Main  theorem.  Let  S  be  a  program  for  an  EREW  PRAM  which  is  designated 
for  some  set  of  inputs  I.  Suppose  S  uses  p  processors,  N  shared  memory 
locations,  local  memories  of  sizes  mj^  ,m2 ,  . . .  .m-  of  respective 
processors  and  runs  in  at  most  T  cycles  for  each  input  in  I.  Assume 
that  for  each  cycle  t,  1  <  t  ^  T,  each  of  the  p  processors  and  all 
inputs  in  I  there  is  at  most  one  common  memory  address  that  can  be 
accessed  by  this  processor  at  this  cycle. 

Then,  a  program  S'  for  an  EREW  PRAM  can  be  constructed  from  S  such  that 
S'  simulates  S  for  each  input  in  I  using  p  processors,  only  p  shared 
memory  locations,  m^  +  N/p  +  0(T)  (1  ->  i  «•  p)  local  memory  locations 
of  respective  processors  and  0(T)  pulses. 

Before  we  proceed  to  the  proof  we  would  like  to  discuss  the 
significance  of  our  theorem.  First,  observe  that  the  assumptions  of 
the  theorem  are  readily  satisfied  if  the  cardinality  of  I  is  one.  This 
is  simply  because  an  execution  of  a  parallel  program  on  some  input  x 
results  in  at  most  one  common  memory  access  at  each  time  by  each 
processor.  Problem.  Find  instances  where  "common  memory  access 
patterns"  of  a  program  S,  for  a  set  of  inputs  I,  are  the  same  (or  about 
the  same)  for  all  the  inputs  in  I. 

It   turns   out   that   researchers   in   the   field   of   numerical 


computations  identified  the  notion  of  serial  straight-line  programs, 
which  characterizes  many  of  the  known  programs  for  problems  in  this 
field.  For  a  definition  of  serial  straight-line  programs  see  [AHU-74] , 
Section  1.5.  Serial  straight-line  programs  for  inputs  of  size  n  do  not 
include  branching,  loops,  or  indirect  addressing.  Therefore,  for  all 
inputs  of  size  n  and  for  each  time  unit  of  such  a  program  the  same 
registers  are  being  accessed. 

Heller  [He-78]  includes  references  to  numerous  numerical  parallel 
algorithms.  Many  of  these  algorithms  satisfy  such  "uniform"  (local  and 
common)  memory  access  pattern  property  including  algorithms  for 
evaluating  arithmetic  expressions  of  a  given  format  (see  [W-75]),  the 
"naive"  matrix  multiplicat Jon,  the  "naive"  raising  of  an  nxn  matrix  to 
the  n-th  power  (in  particular,  transitive  closure;  see  [SJ-8i])  and 
others.  So,  our  theorem  is  applicable  to  these  programs.  Note, 
however,  that  for  our  theorem  we  may  dispense  with  the  uniform  local 
memory  access  pattern  property  and  ease  a  little  the  uniform  common 
memory  property;  as  long  as  no  more  than  one  common  memory  address  can 
be  accessed  for  each  processor  and  each  1  <  t  <  T. 

Proof  of  the  theorem.  Let  us  call  the  time  units  of  S  cycles  and  the 
time  units  of  S'  pulses.  Assume,  w.l.g.,  that  N/p  is  an  integer. 
Otherwise,  we  "add"  some  dummy  common  memory  addresses  in  order  to 
increase  N  to  the  next  multiple  of  p.  Let  P|,P2,...,Pp  (resp. 
Rj  ,R2,. . .  ,Rp)  be  the  processors  of  the  EREW  of  S  (resp.  S').  Let  x^^, 
1  <  i  <  mj^,  be  the  local  registers  of  processor  Pj^  ,  1  <  k  <  p,  and 
Wj  ,  1  <  j  <  N,  be  the  common  memory  locations  of  S.  We  set  xj^  ,  1  <  i 
"•  mj^,  to  be  local  memory  locations  of  processor  Rj^  ,  which  correspond, 
respectively,  to  local  register  x^^  ,  1  <  i  <  m^^ ,  of  processor  Pj^  ,  for 
1  <  k  <p.  Let  u^  ,  1  <  j  <  p,  be  the  common  memory  locations  of  S'. 
Set  y^j.  ,l<k<p,  l<j<  N/p  to  be  local  registers  of  processor  Rj^ 
(in  addition  to  x^  ,x{.    ,...). 

Generally   speaking,   we  design  S'  in  such  a  way  that  processor  R^^ 
simulates  the  behavior  of  processor  P;^  ,  1  <  k  <  p;  each   local  memory 
location  x^   simulates   Xj^  ;  and  the  locations  of  the  form  uj  and  Vj^ 
simulate  the  w^  locations.   The  additional  0(T)  local  memory   locations 
are  required  for  the  code  of  S'  as  explained  at  the  end. 


-7- 

By  our  assumption  no  more  Chan   p  of   the   w.   locations   may  be 

accessed   at  each  cycle  of   S.   Denote  the  w^  locations  which  may  be 

accessed  at  cycle  t,  1  <  t  <  T,  by  Vj.  ,v^  ,...,v^  .   (In  a  cycle  where 

1    2       p 
less   than  p  w^  s   may  be  accessed  we   set   some  of  these  u^^  's  to 

represent  w^'s  which  are  not  accessed  by  any  processor   for  any   input 

during   this   cycle).   In  any  case,  the  locations  Vj.  ,Vj.  ,...,Vj.   are  p 

12       p 
distinct  common  memory  locations. 


A  high-level  description  of  S' 

The  following  condition  is  satisfied  just  before  the  simulation  of 

cycle  t,  1  <  t  <  T,  of  S  by  S'  starts: 

(*)  Each  processor  Rj^,  1  <  k  <  p,  keeps  the  content  of  exactly  one  of 

the  variables  v^  ,  v^  .....  v^   in  a  local  memory  location  of  the  form 

^r  ^2'         '   '^p 
y^,   ;   ^'^d  ^o  more  than  N/p  of  the  w^^  locations  of  S  are  stored  in  the 

local  memory  of  each  processor  R^^  ,  t'<  k  <  p. 

Every  cycle  t  of  S  is  simulai:ed  by  S'  in  three  pulses: 

(1)  The  fetch  pulse.   Processor  R^  which  keeps   the   contents   of 

variable  v^  in  its  y^   local  variable  assigns  it  into  u^. 
j         t  -' 

( 2 )  The  "real-thing"  pulse.       '•  ' 

If  processor  P^^  performs  in  cycl'e  t  an  instruction  which  relates 

to  its  local  registers  only  (or  remains  idle) 
Then  processor  Rj^  does  the  same  with  respect  to  its 

corresponding  x^  registers 
Else   (processor  ?^   performs  an  instruction  of  the  form: 


Xr.  *■   v-  -read  from  memory,  or 


•J 
v^   ■*•  xi,  -write  into  common  memory) 

processor  Rj^  performs  the  same  replacing  Vj.   by  uj 

and  xi^^  by  x(^^  . 


(3)    The  store  pulse.   Processor  R.  copies  the  contents  of  some  (one) 

Lables  so 
every  cycle  which  follows. 


u^  into  one  of  its  y^   variables  so  chat  condition  (*)  will  hold   for 
•^  s 


The  remainder  of  the  proof  is  devoted  to  showing  that  there  is  a 
way  to  partition  initially  the  w^^'s  among  the  local  memories  of  the  Rj^ 
processors,  and  perform  the  store  pulses  of  all  cycles  such  that 
condition  (*)  is  satisfied  before  the  simulation  of  each  cycle.    This 
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is  done  by  reducing  our  problem  to  an  edge-coloring  problem  on  a 
bipartite  graph. 

Consider  an  auxiliary  digraph  G  which  is  defined  as  follows. 

(a)  It  has  (T  +  2N/p)  x  p  vertices.   Txp  of   these   vertices   represent 

the  common  memory   locations   v^  ,   1   <  t  <  T,  1  <  j  <  p.  N  of  these 

vertices,  denoted  v   ,  -(N/p)+l  <t<0,  l<j<p,  represent  each  of 

J 
the  w^  ,   1   <   i   <  N.   They  are  called  input  vertices.   The  last  N 

vertices  denoted  v^  ,  T+1  <  t  <  T+N/p,  1  <  j  <  p,  represent   also  each 

of  the  Wj^  ,  1  <  i  <  N.  They  are  called  output  vertices. 

(b)  There  exists  an  edge  of  the  form  v^  "*■  '^s  ^^ 

(1)  both  v^   and  v-   stand  for  the  same  w,-  ;  and 

tj        s^  1 

(2)  s  >   t  and  there  is  no  v   such  that  s  >  r  >  t  and  v   stands  for 

'^h  "^h 

"i  • 

It  should  be  obvious  that  the  out-degree  of  each  v^  ,  -(N/p)+l  <  t 

J 
<T,  1  <j<pis  one  and  the  out-degree  of  the  other  vertices  is  zero, 

and  the  in-degree  of  each  v^  ,  1  <  t  <  T+N/p,  1  <  j  <  p  is  one  and   the 

j 
in-degree  of  the  other  vertices  is  zero. 

Layer  t  of  G  (L^  in  short)  is  the  set  {vj.   |   1  <  j  <  p)  ,  -(N/p)  + 

1  <  t  <  T  +  N/p.   The  corresponence  between  layers  1,2,...,T  and  cycles 

should  be  obvious. 


Our  solution  assigns  each  edge  of   the  form  v^   *  ^s     ^°  ^ 

processor  Rj^.   This  implies  that  processor  R^^  stores  the  content  of   Uj 

into  a  local  variable  of  the  form  y^   at  the  store  pulse  of  cycle  t  (if 

t  <  0,  then  a  y^,   variable  contains  the  input  value  that  corresponds  to 

a 
v^  )   ;   later,   at  the  fetch  pulse  of  cycle  s  processor  R^  assigns  the 

content  of  y,^  into  U£  (if  s  >  T,  then  a  y^^   variable   contains   the 

output  value  that  corresponds  to  Vg  ). 

In  order  to  satisfy  the  (*)  condition  throughout  the  simulation  it 
is  readily  sufficient  to  do  the  following.  Partition  the  edges  of  G 
into  p  sets  Ci,C2,...,Cq  such  that  for  any  two  edges  e^  and  62  of  the 
same  set  both:  (1)  tail  (ej^)  and  tail  (62)  belong  to  different  layers, 
and  (2)  head(e2^)  and  head(e2)  belong  to  different  layers.  This 
partitioning  enables  us  to  associate  each  of  these  sets  with  a 
processor  which  will  do  the  work  corresponding  to  edges  of  this  set. 

Still,   a   further   simplification  of   the   problem  is  possible. 
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Consider  another  auxiliary  graph  H.  H  is  a  bipartite  undirected  graph. 
Note  that  H  may  include  parallel  edges.  Let  {aj^  ,a2  , . . .  .a-r^M /„}  and 
^^-(N/p)+l ' • • • »^T^  be  the  two  disjoint  sets  of  vertices  of  H.  The 
connection  to  the  digraph  G  becomes  clear  through  the  definition  of  the 
edges  of  H.  There  is  a  one-to-one  correspondence  between  the  edges  of  G 
and  the  edges  of  H.  Let  v^^  -•■  Vg  be  an  edge  of  G.  Then,  the 
corresponding  edge  in  H  is  of  the  form  (bj.,as).  Our  edge  partitioning 
problem  for  G  translates  into  the  following  edge  partitioning  problem 
for  the  undirected  graph  H.  Partition  the  edges  of  H  into  p  sets  such 
that  no  two  edges  of  the  same  set  share  an  end  point. 

This  is  the  well  known  edge  coloring  problem  for  a  bipartite 
graph.  Since  the  degree  of  each  vertex  in  H  is  not  greater  than  p,  a 
known  theorem  (see  [0-67])  implies  that  it  is  possible  to  parititon  the 
edges  of  H  into  p  sets  as  required. 

Algorithms  that  achieve  this  partitioning:  We  refer  the  reader  to 
[GK-82]  for  sequential  algorithms  and  [LPV-81]  for  parallel  algorithms. 

We  would   like   to  ascertain  that   the   proof  of  the  theorem  is 

completed.   The  set  (color)  of  the  edge  in  H  corresponding  to  an  edge 

of   the  form  v^   "*"  v    ,  where  -(N/p)+l  <;  t  <  0 ,  yields  a  processor  Ri^; 

j     x- 
now,  the  contents  of  the  w^.  that  corresponds  to  this  edge  is   initially 

in  one  of  its  y^     locations.   We  need  exactly  N/p  y^^   locations,  for  1 

<  k  <   p,   for   this   initialization.    At   each  fetch  pulse   of   the 

simulation  of   a  cycle  t,  1  <  c  <  T,  we  "release"  one  yi,   ,  1  s  k  <  p. 

This  released  y^^   can  be  used  to  store  the  w^  that  has  to  be  stored  by 

processor  Rj^  as  a  result  of  the  store  pulse  that  follows,  1  *.  k  s  p. 

The  introduction  of  the  v^  's,  T+1  *.  t   <  T+N/p,   gives  actually  an 

"equal"   partition  of  the  outputed  w^'s  which  was  not  "promised"  in  the 

theorem.   They  are  not  necessary  for  the  proof. 

For  each  cycle  t,  1  <  t  <  T,  the  code  of  S'  at  each  procesor  R^  must 

specify   the   yj^   to  be  released  and  reoccupied,  the  u^  into  which  this 

y^^   is  copied  in  the  fetch  pulse,  the  u^  (if  any)  that  may  be   accessed 

in   the  real-thing  pulse  and  the  u^  copied  into  yj^   in  the  store  pulse. 

Thus,  the  code  of  S'  is  longer  by  0(T)  than  the  code  of  S  at  each  local 

memory. 

Extensions.  All  the  results  in  this  paper  can  be  extended  in  a 
straightforward  manner   to  more   permissive   models   of   parallel 
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computation  where  simultaneous  access  of  several  processors  to  the  same 
memory  location  is  allowed;  in  particular,  the  powerful  concurrent-read 
concurrent-write  (CRCW)  PRAM  allows  several  processors  to  read  (or 
write)  simultaneously  from  (into)  the  same  memory  location.  See 
[SV-82]  for  more  on  these  models  of  computation. 

Acknowledgement .  A  referee  remarked  the  need  for  the  additional  0(T) 
local  memories  in  the  main  theorem.  We  are  grateful  for  this  remark. 
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